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Independence and Statistical Inference in 
Clinical Trial Designs: A Tutorial Review 



Sanford Bolton, PhD 



The requirements for statistical approaches to the de- 
sign, analysis, and interpretation of experimental data 
are now accepted by the scientific community. This is of 
particular importance in medical studies where public 
health consequences are of concern. Investigators in the 
clinical sciences should be cognizant of statistical prin- 
ciples in general but should always be wary of the 
pursuing their own analyses and engage statisticians for 
data analysis whenever possible. Examples of circum- 
stances that require statistical evaluation not found in 
textbooks and not always obvious to the lay person are 
pervasive. Incorrect statistical evaluation and analyses 
in such situations will result in erroneous and poten- 



tially serious misleading interpretation of clinical data. 
Although a statistician may not be responsible for any 
misinterpretations in such unfortunate circumstances, 
the quote often cited about statisticians and ^damned 
liars" may appear to be more truth than fable. This 
article is a tutorial review and describes a common 
misuse of clinical data resulting in an apparently large 
sample size derived from a small number of patients. 
This mistake is a consequence of ignoring the depen- 
dency of results, treating multiple observations from a 
single patient as independent observations. 
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Assumptions regarding the nature of data are 
often crucial to statistical analysis and inter- 
pretation of experimental results. When using sta- 
tistical procedures, many scientists are very con- 
cerned with assumptions of normality of data 
distributions and equality of variance in compar- 
ative treatment groups. Although these are impor- 
tant issues, lack of normality is usually of little 
concern if we are dealing with means. Means tend 
to be normal, particularly if the sample size is 
reasonably large. Lack of equality of variance in 
the groups being compared can be troublesome, 
but if variances are not too different, the conse- 
quences are not dire, and if necessary alternative 
analyses can be used to compensate for variance 
heterogeneity (heteroscedascicity). 

A more crucial assumption in many statistical 
analyses is that the treatments being compared are 
independent, a feature of data that is often over- 
looked. In lay terms, independence means that an 
experimental observation in a study does not in- 
fluence the outcome of another experimental ob- 
servation. Mathematically, this can be expressed 
as P(X|Y) = P(X). In plain words, this says that if 
the probability of an outcome X, given the out- 
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come Y, is equal to the probability of X absent the 
outcome Y, then X and Y are independent. For 
example, if a patient (A) has blood international 
normalized ratio (INR) values (a measure of clot- 
ting time) close to 3 IU and another patient (B) has 
values close to 2 IU, the readings for patient A 
readings will be independent of those for patient 
B. However, the multiple readings for patient A are 
not independent in the context of a group of aver- 
age INR readings for N patients. Because values for 
patient A are close to 3 IU, we would expect the 
readings to be related to the clotting characteristics 
of that particular patient. 

Before providing some examples of how conclu- 
sions of experiments can be distorted when the con- 
cept of independence is ignored, one should appre- 
ciate that independence cannot be looked at in 
isolation. The experimental design must be detailed 
along with a thorough understanding of the nature of 
the experiment and its objectives to assess the pres- 
ence or lack of independence. For example, if the 
purpose of an experiment is to estimate the rate or 
incidence of adverse events for a single patient, the 
observations may or may not be independent, de- 
pending on one's understanding of the physiologic 
processes or the objective of the study. To assess the 
incidence of gastrointestinal upset in a patient who 
takes a single dose of an analgesic every few weeks, 
the events may be considered independent. One 
could assume that the incidence of an upset stomach 
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on one occasion is not influenced by previous events 
and would not influence an observation at a future 
time. 

Of course, this is not necessarily true, and it there 
is some doubt, the data could be examined for trends 
and associations, using a correlation or regression 
analysis, for example. Cholesterol readings taken at 
random and at reasonably long intervals for a single 
patient might be considered independent events for 
that patient. In the latter case, we could assign some 
average cholesterol reading for the patient along 
with some measure of variability, such as the stan- 
dard deviation, estimated from multiple observa- 
tions on the patient. If the cholesterol reading were 
taken twice a year, directly after Christmas and Eas- 
ter, one may question the independence of the read- 
ings because of potential elevated levels resulting 
from excessive feasting. 

EXAMPLES OF ABUSES OF THE CONCEPT OF 
INDEPENDENCE 

In a different context, consider the observation of the 
number of adverse events after four doses of an an- 
algesic on four separate occasions for two groups of 
patients, each assigned to one of two treatment 
groups. Are the adverse events independent? It 
would appear that the adverse events for one patient 
would be independent of those for another patient, 
whether or not the other patient is in the same treat- 
ment group. However, in the context of this study, 
the multiple adverse events for an individual patient 
are not independent. 

For example, a patient who is prone to adverse 
events is likely to have a large incidence of adverse 
events and vice versa. In this case, the experimental 
unit is the patient. Each patient has a score related to 
the incidence of adverse events, and the sample size 
relates to the number of patients, not the total num- 
ber of observations. In the former case, where we 
were looking at a single patient, the experimental 
unit is the adverse event and the sample size is the 
number of events. In the latter case, the patient is the 
experimental unit and the sample size is the number 
of patients. 

This can be translated into statistics when we per- 
form a statistical test. For the observations of a single 
patient, we might say, for example, that 9 times in 15 
administrations an adverse event was observed, a 
rate of 60%. Thus, this patient has (approximately) a 
60% chance of having an adverse event after admin- 
istration of the analgesic. In the latter case where the 
patient is the experimental unit, the situation is not 
as clear, and the analysis depends on the question or 
objective. If the goal is to compare the proportion of 
patients who have at least one adverse event, we 
would simply count those patients with at least one 



adverse event. If we were also interested in the ac- 
tual number of occasions that an adverse event was 
observed for each patient, we might give each patient 
a score of 0 to 4, depending on how many of the four 
occasions an adverse event was noted. We could 
then perform a test comparing the average result of 
the two groups, such as a / test, where the sample 
size is the number of patients. 

Although this seems simplistic and straightfor- 
ward, I (and others) have noted on many occasions 
such analyses where the analysis uses a sample size 
equal to the total number of administrations, 4N, or 
four times the number of patients in this example, 
which would exaggerate the apparent sample size 
four-fold. This gives the misleading impression that 
the study had four times the number of patients, 
which would severely influence the statistical con- 
clusions. Specific examples are given below to illus- 
trate this point, 

Although this is obviously an incorrect analysis, 
the concept may be made more clear if we consider 
an extreme example, an experiment where only one 
patient is in each treatment group. Suppose that one 
patient had four adverse events, one at each admin- 
istration, and the other patient had no adverse 
events. Would one conclude that the treatments are 
different? Could we just be seeing an example of a 
very susceptible patient compared with a more tol- 
erant patient rather than a treatment difference? Con- 
sider another example where there are three patients 
in each of two groups, each receiving the drug 10 
times. Five of the six patients have no adverse events 
and one has 10 adverse events. Should we make our 
comparison as 0/3 versus 1/3 (the numerator is the 
number of patients with at least one adverse event) 
or 0/30 versus 10/30 (the denominator is the total 
number of administrations)? Would it matter? Cer- 
tainly! 0/3 versus 1/3 is not even close to significant, 
whereas 0/30 versus 10/30 is highly significant (P < 
0,001), using a test such as Fisher's exact test. The 
application of the latter test is incorrect; the sample 
size is greatly exaggerated. 

Perhaps, this problem can be more easily visu- 
alized by the often-used example of two groups of 
rats subjected to two treatments, where each treat- 
ment group is housed in an individual cage. The 
lack of independence is evident by the interactions 
of mice in each cage and how they affect each 
other. One or two aggressive rats can influence the 
results of all of the caged animals. Experimenters 
know to house animals separately in such experi- 
ments. In fact, a cage is an experimental unit if it 
houses one mouse or 50 mice. Note the analogy to 
the example of multiple measurements for each 
patient in a study. 
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EXAMPLES 

This problem has been discussed often in the litera- 
ture. Dr. Donald Mainland was a man of particular 
practicality as well as a noted statistician and phy- 
sician. In his book. Elementary Medical Statistics^ 
Dr. Mainland spends some time on this problem and 
the pitfalls associated with it. He says "... many 
research workers are rather vague about their sam- 
pling units and the need for their independence." 1 
He gives an example of a comparison of two tooth- 
pastes (A and B) f one given each to two boys. One 
boy has eight carious teeth and the other none. He 
concludes that "... this looks like an impressive 
difference, but we need no profound knowledge of 
dentistry or of statistics to realize that it provides no 
adequate evidence that the difference in toothpastes 
was responsible." 1 Note the analogous example to 
adverse events given earlier in this paper. Mainland 
continues "it is boys, not teeth, that are the sampling 
units, and there is only one sampling unit in each of 
the (A) and (B) samples — no true replicates." 1 He 
gives an example of a study of 36,196 teeth in 1,870 
children, approximately 20 teeth per child. Each 
tooth was considered as an independent piece of 
information in the publication. Mainland calls this a 
"spurious enlargement of samples," "spurious repli- 
cation." or "counting the same thing over again." 1 
Children are the proper sampling units. 

Mainland also discusses the oft-quoted "caged an- 
imal" example, using a comparison of diets in pigs in 
which each treatment group is in a pen. Again, he 
says, there is no replication. There are only two 
observations, one on each diet; the pen is the sam- 
pling unit, as noted above. Another similar example 
is in group discussions. 

He discusses an experiment testing the efficacy of 
group therapy using patients with a chronic disease. 
The individuals in the group are not independent, 
perhaps being affected by "one or two dominant . . . 
.persons who spread a contagion." 1 A similar venue 
is the use of panels, or focus groups, to discuss 
consumer behavior for the purpose of developing a 
marketing strategy for consumer products. In this 
scenario, a group of people are brought together to 
express views on a product or idea. Often, one per- 
son may be dominant and unduly influence the 
opinions of the other panelists. 

Mainland gives other examples, but gets right to 
the point, defining the analysis of multiple observa- 
tions from individual units (patients, for example] as 
mixed sampling. He contends that one has to be very 
careful about the objective of the experiment. Are we 
interested in a population of patients or in a popu- 
lation of readings from an individual patient? In fact, 
what Dr. Mainland noted more than 30 years ago, is 
still true today. We still see examples of misuse of 



"mixed sampling," where differences between indi- 
vidual units (patients) are ignored. In particular, he 
says, "we should become very suspicious if the num- 
bers of readings differ in different subjects even if 
there is no mixture of sampling units." 1 

Another noted medical statistician, Dr. Theo- 
dore Colton of Harvard Medical School, discusses 
the same problem in his book Statistics in Medi- 
cine 2 under the heading "Mishandling of Replicate 
Data." He says "often an investigator obtains rep- 
licate observations and fails to account for this in 
his analysis. In addition to incorrect analysis, the 
reader is often misled into believing there are 
many many more observations than there actually 
are," 2 He continues, "data analysis that fails to 
distinguish between observations on the same in- 
dividual and observations on different individuals 
is entirely meaningless." 2 Colton's example of this 
misuse of data considers the average blood pres- 
sure of a group of 10 patients with hypertension 
who are taking a particular treatment. Each patient 
is seen twice a week for 8 weeks (16 visits) and has 
five readings at each visit, for a total of 80 readings 
per patient. Certainly the 800 readings do not rep- 
resent 800 patients. Colton states that "obviously, 
one cannot simply proceed to analyze the 800 
observations as if they were derived from 800 dif- 
ferent patients . . . .The assessment of the drug's 
effect involves a statistical inference from the sam- 
ple of 10 to a larger population of hypertensive 
subjects." 2 Colton concludes that "a moral to be 
derived from this example is that care should be 
taken to avoid being misled by great masses of 
observations. Upon close scrutiny, these masses 
may often vanish . . . .The following situations 
involving great masses of data are prone to such 
mishandling and misinterpretations: episodes, at- 
tacks or exacerbation of a disease . . . " 2 This also 
may include the incidence of adverse effects and 
lack of efficacy or toxic effects due to medication 
(see below). 

It is clear from these examples that this misuse of 
statistics has existed for some time and continues to 
exist despite continued discussion and warning of 
its misuse. Some years ago, I was dismayed by an 
article that appeared in a highly reputable journal, 
which showed a particular drug to be responsible for 
increased adverse events (in this case it was in- 
creased serum potassium) using the same "masses of 
data," "mixture of sampling units" approach. There 
were many observations repeated on a few subjects, 
with no information on how many times each sub- 
ject's potassium level was measured. 

Recently, I was presented with a paper from a 
reputable journal, 11 which had exactly the problems 
discussed in this article. In fact, the mixture of sam- 
pling units misanalysis was repeated more than once 
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ndi- on various experimental outcomes. The problems in 
, he the analysis, including an erroneous calculation, 
urn- were reviewed in a previous publication. 4 Some of 
n if the relevant details for one of the most important 
analyses on which the conclusions of the study were 
leo- based will follow, as yet another example of the 
ises misuse of "masses of data." In particular, although 
edi- the naive error in the original article seemed to have 
:ate been made in good faith, the conclusions of the 
•ep- article were subsequently used to promote one prod- 
5 in uct and to suggest potential deficiencies in a com- 
the peting product by a knowledgeable concerned party, 
are The paper in question* describes a retrospective 
illy study consisting of patients seen in a hospital anti- 
; to coagulant clinic in 1980. The analysis of the data 
in- was used to "prove" and document the problems 
ials that resulted from substitution of Coumadin (crystal- 
this line warfarin; DuPont Merck, Wilmington, DE) tab- 
res- lets with Panwarfarin tablets (amorphous warfarin, 
ion now discontinued; Abbott Laboratories, Abbott Park, 
ent IL). Both products contained sodium warfarin of 
has identical dosage units. During this time, the hospital 
ngs pharmacy had switched from Coumadin to Panwar- 
ep- farin without the knowledge of the medical person- 
sly, nel. Some patients apparently had been prescribed a 
BOO sufficient quantity of Coumadin so that they were 
dif- not switched to Panwarfarin in the course of time of 
ig's the retrospective study. Other patients were given 
im- new prescriptions and received the alternate sodium 
;ive warfarin (Panwarfarin). The principal analysis was 
be based on the number or proportion of out-of-range 
be prothrombin times that occurred during multiple pa- 
. of tient visits to the clinic. The number of patient visits 
:ses also were evaluated for the two groups of patients, 
Dns those who had or had not switched products, 
ich ' Fifteen patients were switched to Panwarfarin 
at- (switch group) and 40 patients continued to take 
tlso 1 Coumadin (Coumadin group). Two statistical tests 
ind were performed using the erroneous "mixing of sam- 
ion pies" with excessive amounts of data. This should 

have been apparent to the authors and the propo- 

3 of nents of Coumadin based on a gross difference of 

5 to ' probability levels between two tests which are es- 

i of sentially measuring the same event (see below), 

an What seemed to be the most critical test based on the 

ial, data was a comparison of the two groups with re- 

for • spect to the proportion of visits in which prothrom- 

in- bin time was in the normal range. The significance 

5 of > level was 0.001. Based on the previous examples in 

ere the article, the problem is self evident. In the switch 

cts, group 29 visits of 74 resulted in values in the normal 

ub- ' range (39%), whereas in the Coumadin (nonswitch) 

group 121 of 177 visits (68%) resulted in values in 

i a * ■ the normal range. The level of 0.001 should have 

;ms raised a red flag had the authors compared this result 

im- > with the test comparing the nonsignificant result [P 

ice = 0.07) for the proportion of patients whose levels 



were in the normal range for the two groups. Clearly, 
the reason for this apparent contradiction is the mix- 
ing of sample problem as clearly delineated by Main- 
land. 1 It is misleading and just not correct to use all 
of the observations, disregarding the important ex- 
perimental units (the patients). 

Going back to the purpose of this paper, from a 
statistical point of view the erroneous analysis is a 
result of a lack of independence of the observations 
when individual patient results are pooled together 
forming a mass of data/ Using an example similar to 
one that has already been presented, we can show 
how analyzing data in this way can misrepresent 
what is happening. Using the overall pooled results 
in the study, let us suppose that one patient in the 
switch group (Coumadin to Panwarfarin) was out of 
range 16 times in 16 visits (100%). Let us further 
suppose that the remaining 13 patients were seen for 
a total of 58 visits and each patient was out of range 
once. This is equivalent to 45 in-range values in 58 
visits (78%). This is a better success rate than that 
observed in the Coumadin group. This demonstrates 
that data from only one extreme patient can bias the 
conclusion if the data from individual patients are 
not examined! In this kind of study, patients are the 
experimental units. When we do not account for 
patients in the analysis, the sample size is dramati- 
cally increased: 15 patients appear as 77, as elo- 
quently described by Colton, 2 resulting in an equally 
apparent dramatic increase in significance (a de- 
crease in the P value). 

This incorrect analysis was repeated when the 
authors compared the proportion of months during 
which prothrombin times were out of range for pa- 
tients in the switch group before the switch and 
during the switch. It was not clear exactly how this 
was computed, but in any event the conclusions can 
not be considered valid in the absence of a compar- 
ative group (the Coumadin group, those who were 
not switched). This is a different problem with the 
analysis, requiring a separate discussion. However, 
with regard to the issue of independent observations 
and mixing of samples, we have the same problem. 
Patients are seen for different periods of time in the 
prestudy and study periods with different numbers 
of visits per month. This particular perplexing prob- 
lem has been noted by both Mainland 1 and Colton. 2 
The unbalanced data set immensely complicates any 
statistical evaluation. 

For example, if a patient is seen once each month 
for 6 months, or six times, the proportion of months 
in which an out of range result will be observed will 
almost certainly be smaller than if a patient is seen 
six times in one month. Also, as noted in the original 
paper, there were fewer visits in the prestudy period 
and a longer time span on the average than the post- 
study period. Consider the situation in which a pa- 
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tient misses some months in the prestudy period: 
how would that data be treated? Suppose a patient 
had a visit on the first and last days of a month. How 
would this data be compared with results of visits on 
the first day of the month and the first of the follow- 
ing month? If there was a failure only on the first day 
of the first month, the former patient would have a 
proportion of 100% compared with 50% for the lat- 
ter. 

When used properly, statistics are clearly a pow- 
erful tool for making decisions. Improper use of sta- 
tistics is unfortunate not only because of inappropri- 
ate conclusions based on the incorrect analyses, but 
the resulting maligning and bad jokes that are prop- 
agated about the misrepresentation of data and con- 



clusions by statistical means. Hopefully, the exam- 
ples presented here will be useful in preventing 
misrepresentation of experimental results based on a 
falsely inflated sample size derived from the incor- 
rect experimental units. 
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